Experimental Results on the Native Language Identification Shared Task

نویسندگان

  • Amjad Abu-Jbara
  • Rahul Jha
  • Eric Morley
  • Dragomir R. Radev
چکیده

We present a system for automatically identifying the native language of a writer. We experiment with a large set of features and train them on a corpus of 9,900 essays written in English by speakers of 11 different languages. our system achieved an accuracy of 43% on the test data, improved to 63% with improved feature normalization. In this paper, we present the features used in our system, describe our experiments and provide an analysis of our results.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Report on the 2017 Native Language Identification Shared Task

Native Language Identification (NLI) is the task of automatically identifying the native language (L1) of an individual based on their language production in a learned language. It is typically framed as a classification task where the set of L1s is known a priori. Two previous shared tasks on NLI have been organized where the aim was to identify the L1 of learners of English based on essays (2...

متن کامل

A Report on the First Native Language Identification Shared Task

Native Language Identification, or NLI, is the task of automatically classifying the L1 of a writer based solely on his or her essay written in another language. This problem area has seen a spike in interest in recent years as it can have an impact on educational applications tailored towards non-native speakers of a language, as well as authorship profiling. While there has been a growing bod...

متن کامل

LIMSI's participation to the 2013 shared task on Native Language Identification

This paper describes LIMSI’s participation to the first shared task on Native Language Identification. Our submission uses a Maximum Entropy classifier, using as features character and chunk n-grams, spelling and grammatical mistakes, and lexical preferences. Performance was slightly improved by using a twostep classifier to better distinguish otherwise easily confused native languages.

متن کامل

LIGA and Syllabification Approach for Language Identification and Back Transliteration : Shared Task Report by DAIICT

This paper aims to address the solution for the Subtask 1 of Shared Task on transliterated search,a task in FIRE ’14. The task addresses the problem of data containing English words and transliterated words of Indian languages in English.The task calls for language identification and subsequent back transliteration into the native Indian scripts.The system proposed herewith implements Language ...

متن کامل

The Story of the Characters, the DNA and the Native Language

This paper presents our approach to the 2013 Native Language Identification shared task, which is based on machine learning methods that work at the character level. More precisely, we used several string kernels and a kernel based on Local Rank Distance (LRD). Actually, our best system was a kernel combination of string kernel and LRD. While string kernels have been used before in text analysi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013